Broken title search ut #411

nadolskit · 2024-09-14T06:10:28Z

Summary of Problem
The expected vs asserted citation count was not matching, causing this UT to specifically fail.

I noticed if I switched the order of sources from semantic_scholar, crossref to crossref, semantic_scholar the citation count would be even more dramatically different (similar to the results James was noticing earlier)

More importantly, I found the papers on semantic scholar (25); which doesn't match our expected value (23); so that failure makes sense.
The following assertion is failing for the same reason: 196 assertion vs 191 expected.

Vibe checked the results of this change via a CI run: https://github.com/Future-House/paper-qa/actions/runs/10859898582/job/30139826477

This does not fix the issue that James found earlier this evening in test_agents.py.

… UTs

tests/utils/paper_helpers.py

jamesbraza · 2024-09-14T06:15:18Z

tests/utils/paper_helpers.py

+    citation_pattern = r"(This article has )\d+( citations?)"
+
+    # between group 1 and 2, replace with the character "n"
+    expected_cleaned = re.sub(citation_pattern, r"\1n\2", expected).strip()
+    actual_cleaned = re.sub(citation_pattern, r"\1n\2", actual).strip()


I think you got the capturing groups wrong, you want r"This article has (\d+) citations?"

We don't care to extract the text, we want to extract the number

Then you can just directly compare the int, no need for the weird r"\1n\2"

That's fair.
This original approach using \1n\2 was because it's focusing on comparing the structure of the citation text while ignoring the citation count itself; just ignoring whatever number is there.

But you make a valid point. I'll get this updated!

Fwiw (I am not sure your experience levels with regex) going from r"(This article has )\d+( citations?)" to r"This article has (\d+) citations?" will still match the surrounding text too. A capturing group () is here to make it easy to capture and extract a value

And plz point out if we're on the same page already haha. Hope you have a good night

I believe I've addressed this accordingly. Thx James!

jamesbraza · 2024-09-14T06:16:21Z

tests/utils/paper_helpers.py

+    :return: True if the citations match except for the citation count, False otherwise.
+    """
+    # https://regex101.com/r/lCN8ET/1
+    citation_pattern = r"(This article has )\d+( citations?)"


This regex is closely coupled to DocDetails. Can you move this regex to a ClassVar[str] on DocDetails?

mskarlin · 2024-09-14T16:05:31Z

The minor variations, i.e. 196 vs 191 is expected, because the papers will get more citations over time. The bigger question is why the cassettes aren’t being used because this should come back with a recorded request each time, I’m in favor of root causing that before we merge this. Maybe we regenerated these cassettes but didn’t update the test?

jamesbraza · 2024-09-14T19:45:45Z

tests/test_clients.py

+        assert (
+            expected_citation_format is not None
+        ), "Expected string should match the citation pattern"
+        assert (
+            actual_citation_format is not None
+        ), "Actual string should match the citation pattern"


Suggested change

assert (

expected_citation_format is not None

), "Expected string should match the citation pattern"

assert (

actual_citation_format is not None

), "Actual string should match the citation pattern"

assert (

expected_citation_format

), "Expected string should match the citation pattern"

assert (

actual_citation_format

), "Actual string should match the citation pattern"

This may resolve the union-attr ignore a few lines below

jamesbraza · 2024-09-14T19:50:16Z

tests/test_clients.py

+        expected_remaining = (
+            paper_attributes["formatted_citation"][: expected_citation_format.start()]
+            + paper_attributes["formatted_citation"][expected_citation_format.end() :]
+        )
+
+        actual_remaining = (
+            details.formatted_citation[: actual_citation_format.start()]  # type: ignore[union-attr]
+            + details.formatted_citation[actual_citation_format.end() :]  # type: ignore[union-attr]
+        )
+
+        # Assert that the parts of the strings outside the citation count are identical
+        assert (
+            expected_remaining == actual_remaining
+        ), "Formatted citation text should match except for citation count"


So we are doing a regex search to find a region, then removing that region, and lastly doing an equality check on the front/back end.

I think it would be simpler to do one regex removal of the region, and directly equality compare the remanants.

I also think this is what you originally had 😅 😓 hahaha sorry! I actually realized I misunderstood your original logic last night, my bad here.

What do you think?

definitely! no worries, communicating through PRs can be hard sometimes. Thank you!! adjusting now.

before I merge this I'm also going to look into why the cassette file aren't (hasn't) been updated

tests/test_clients.py

…ues. Pull main.

paperqa/types.py

…broken-title-search-UT

jamesbraza

Not sure why some requests were missing from fixtures, but otherwise LGTM and thanks for doing this

paperqa/types.py

Co-authored-by: James Braza <[email protected]>

nadolskit added 3 commits September 13, 2024 21:05

Changes assertion citations to match output, testing CI

0db661c

Ignore specific citation count in UT, build helper function for other…

092e7c4

… UTs

Small grammar adjustment

013afde

nadolskit requested review from whitead, jamesbraza and mskarlin September 14, 2024 06:10

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Sep 14, 2024

jamesbraza reviewed Sep 14, 2024

View reviewed changes

change implementation, update UTs, cleanup, lint

669988b

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Sep 14, 2024

jamesbraza reviewed Sep 14, 2024

View reviewed changes

Simplify regex expression, simplify test, add clarifying comment

533fe9a

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 14, 2024

remove regex solution, replaced with logic that handles undefined val…

145af72

…ues. Pull main.

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Sep 16, 2024

nadolskit added 4 commits September 16, 2024 11:19

remove vestigial print statemen

5ff4022

lint

c23161e

correctly disable rule

88671f7

Merge branch 'main' into broken-title-search-UT

dae344e

jamesbraza reviewed Sep 16, 2024

View reviewed changes

paperqa/types.py Show resolved Hide resolved

nadolskit added 4 commits September 16, 2024 11:58

Add clarifying comment

5f86258

Merge branch 'main' into broken-title-search-UT

a66b68f

Merge branch 'main' of https://github.com/Future-House/paper-qa into …

6aca298

…broken-title-search-UT

main version of journalquality

5d8e38c

dosubot bot removed the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 16, 2024

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 16, 2024

jamesbraza approved these changes Sep 16, 2024

View reviewed changes

paperqa/types.py Outdated Show resolved Hide resolved

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 16, 2024

Update paperqa/types.py

281a849

Co-authored-by: James Braza <[email protected]>

nadolskit merged commit 2f7e462 into main Sep 16, 2024
5 checks passed

nadolskit deleted the broken-title-search-UT branch September 16, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken title search ut #411

Broken title search ut #411

nadolskit commented Sep 14, 2024

jamesbraza Sep 14, 2024

nadolskit Sep 14, 2024

jamesbraza Sep 14, 2024

nadolskit Sep 14, 2024

jamesbraza Sep 14, 2024

nadolskit Sep 14, 2024

mskarlin commented Sep 14, 2024

jamesbraza Sep 14, 2024

jamesbraza Sep 14, 2024

nadolskit Sep 14, 2024

jamesbraza left a comment

Broken title search ut #411

Broken title search ut #411

Conversation

nadolskit commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskarlin commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesbraza left a comment

Choose a reason for hiding this comment